Assignment 2

Author

Jared Tavares and Heeeyletje van Zyl

Abstract

Introduction

The field of Natural Language Processing (NLP) is faceted by techniques tailored for theme tracking and opinion mining which merge part of text analysis. Though, of particular prominence, is the extraction of latent thematic patterns and the establishment of the extent of emotionality expressed in political-based texts.

Given such political context, it is of specific interest to analyse the annual State of the Nation Address (SONA) speeches delivered by six different South African presidents (F.W. de Klerk, N.R. Mandela, T.M. Mbeki, K.P. Motlanthe, J.G. Zuma, and M.C. Ramaphosa) ranging over twenty-nine years (from 1994 to 2023). This analysis, descriptive and data-driven in nature, endeavours to examine the content of the SONA speeches in terms of themes via topic modelling (TM) and emotions via sentiment analysis (SentA). In general, as illustrated in Figure, this exploration will be double-bifurcated, executing the aforementioned techniques within a macro and micro context both at the text (all-presidents versus by-president SONA speeches, respectively) and token (sentences versus words, respectively) level.

Schematic representation of sentiment analysis and topic modelling investigated within at different scales within differnt levels.

Through such a multi-layered lens, the identification of any trends, both in terms of topics and sentiments, over time at both a large (presidents as a collective) as well as at a small (each president as an individual) scale is attainable. This explicates not only an aggregated perspective of the general political discourse prevailing within South Africa, but also a more niche outlook of the specific rhetoric employed by each of the country’s serving presidents during different date periods.

To achieve all of the above-mentioned analysis, it is first relevant to revise foundational terms and review related literature in context of politics and NLP. All pertinent pre-processing of the political text data is then considered, followed by a discussion delving into the details of each SentA and TM approach applied as part of the analysis. Specifically, three different lexicons are leveraged to describe sentiments, whilst five different topic models are tackled to uncover themes within South-African-presidents’ SONA speeches. Ensuing the implementation of these methodologies, the results thereof are detailed in terms insights and interpretations. Thereafter, an overall evaluation of the techniques in terms of efficacy and inadequacy is overviewed. Finally, focal findings are highlighted and potential improvements as part of future research are recommended.

Methods

Topic modelling

Latent Semantic Analysis (LSA)

LSA [@Deerwester1990] is a non-probabilistic, non-generative model where a form of matrix factorization is utilized to uncover few latent topics, capturing meaningful relationships among documents/tokens. As depicted in Figure, in the first step, a document-term matrix DTM is generated from the raw text data by tokenizing d documents into w words (or sentences), forming the columns and rows respectively. Each row-column entry is either valued via the BoW or tf-idf approach. This DTM-matrix, which is often sparse and high-dimensional, is then decomposed via a dimensionality-reduction-technique, namely truncated Singular Value Decomposition (SVD). Consequently, in the second step the DTM-matrix becomes the product of three matrices: the topic-word matrix At* (for the tokens), the topic-prevalence matrix Bt* (for the latent semantic factors), and the transposed document-topic matrix CTt* (for the document). Here, t*, the optimal number of topics, is a hyperparameter which is refined at a value (either via the Silhouette-Coefficient or the coherence-measure approach) that retains the most significant dimensions in the transformed space. In the final step, the text data is then encoded using this top-topic number. Given LSA only implicates a DTM-matrix, the implementation thereof is generally efficient. Though, with the involvement of truncated SVD, some computational intensity and a lack of quick updates with new, incoming text-data can arise. Additional LSA drawbacks include: the lack of interpretability, the underlying linear-model framework (which results in poor performance on text-data with non-linear dependencies), and the underlying Gaussian assumption for tokens in documents (which may not be an appropriate distribution). ### Probabilistic Latent Semantic Analysis (pLSA) ```{=html}

Instead of implementing truncated SVD, pLSA (Hofmann 1999) rather utilizes a generative, probabilistic model. Within this framework, a document d is first selected with probability P(d). Then given this, a latent topic t is present in this selected document d and so chosen with probability of P(t|d). Finally, given this chosen topic t, a word w (or sentence) is generated from it with probability P(w|t), as shown in Figure. It is noted that the values of P(d) is determined directly from the corpus D which is defined in terms of a DTM matrix. In contrast, the probabilities P(t|d) and P(w|t) are parameters modelled as multinomial distributions and iteratively updated via the Expectation-Maximization (EM) algorithm. Direct parallelism between LSA and pLSA can be drawn via the methods’ parameterization, as conveyed via matching colours of the topic-word matrix and P(w|t), the document-topic matrix and P(d|t) as well as the topic-prevalence matrix and P(t) displayed in Figure and Figure, respectively.

Despite pLSA implicitly addressing LSA-related disadvantages, this method still involves two main drawbacks. There is no probability model for the document-topic probabilities P(t|d), resulting in the inability to assign topic mixtures to new, unseen documents not trained on. Model parameters also then increase linearly with the number of documents added, making this method more susceptible to overfitting.

Latent Dirichlet Allocation

Schematic representation of LDA.

LDA is another generative, probabilistic model which can be deemed as a hierarchical Bayesian version of pLSA. Via explicitly defining a generative model for the document-topic probabilities, both the above-mentioned pitfalls of pLSA are improved upon. The number of parameters to estimate drastically decrease and the ability to apply and generalize to new, unseen documents is attainable. As presented in Figure, the initial steps first involve randomly sampling a document-topic probability distribution (\(\theta\)) from a Dirichlet (Dir) distribution (\(\eta\)), followed by randomly sampling a topic-word probability distribution (\(\phi\)) from another Dirichlet distribution (\(\tau\)). From the \(\theta\) distribution, a topic t is selected by drawing from a multinomial (Mult) distribution (third step) and from the \(\phi\) distribution given said topic t, a word w (or sentences) is sampled from another multinomial distribution (fourth step). The associated LDA-parameters are then estimated via a variational expectation maximization algorithm or collapsed Gibbs sampling.

Correlated Topic Model (CTM)

Following closely to LDA, the CTM (Lafferty and Blei 2005) additionally allows for the ability to model the presence of any correlated topics. Such topic correlations are introduced via the inclusion of the multivariate normal (MultNorm) distribution with t length-vector of means (\(\mu\)) and t \(\times\) t covariance matrix (\(\Sigma\)) where the resulting values are then mapped into probabilities by passing through a logistic (log) transformation. Comparing Figure and Figure, the nuance between LDA and CTM is highlighted in light-blue, where the discrepancy in the models come about from replacing the Dirichlet distribution (which involves the implicit assumption of independence across topics) with the logit-normal distribution (which now explicitly enables for topic dependency via a covariance structure) for generating document-topic probabilities. The other generative processes previously outlined for LDA is retained and repeated for CTM. Given this additional model complexity, the more convoluted mean-field variational inference algorithm is employed for CTM-parameter estimation which necessitate many iterations for optimization purposes. CTM is consequently computationally more expensive than LDA. Though, this snag is far outweighed by the procurement of richer topics with overt relationships acknowledged between these.

Read in the data

Exploratory Data Analysis

'saved_plots/top_10_words_across_speeches_chart.pdf'

Sentiment analysis

Topic modelling

LSA

(0, '0.267*"year" + 0.242*"government" + 0.198*"work" + 0.195*"south" + 0.188*"people" + 0.163*"country" + 0.145*"development" + 0.142*"national" + 0.140*"programme" + 0.134*"african"')
(1, '-0.169*"government" + 0.146*"south" + -0.142*"regard" + 0.135*"year" + -0.134*"people" + 0.115*"energy" + 0.114*"000" + -0.113*"shall" + -0.112*"ensure" + -0.102*"question"')
(2, '-0.140*"honourable" + -0.131*"programme" + 0.125*"pandemic" + -0.123*"continue" + 0.115*"new" + -0.110*"development" + -0.109*"rand" + 0.107*"great" + -0.106*"compatriot" + 0.102*"investment"')
(3, '-0.337*"alliance" + -0.240*"transitional" + -0.204*"party" + -0.204*"constitution" + -0.156*"zulu" + -0.155*"constitutional" + -0.131*"south" + -0.126*"concern" + -0.125*"election" + -0.122*"freedom"')
(4, '0.219*"shall" + -0.204*"people" + 0.148*"year" + -0.144*"alliance" + 0.130*"start" + -0.101*"government" + -0.097*"address" + -0.093*"transitional" + 0.088*"community" + 0.088*"citizen"')

pLSA (Probabilistic Latent Semantic Analysis)

[(0,
  '0.001*"year" + 0.000*"country" + 0.000*"south" + 0.000*"work" + 0.000*"programme" + 0.000*"government" + 0.000*"national" + 0.000*"development" + 0.000*"people" + 0.000*"continue"'),
 (1,
  '0.001*"year" + 0.000*"government" + 0.000*"south" + 0.000*"african" + 0.000*"people" + 0.000*"work" + 0.000*"country" + 0.000*"africa" + 0.000*"development" + 0.000*"programme"'),
 (2,
  '0.001*"government" + 0.001*"south" + 0.001*"year" + 0.001*"work" + 0.001*"people" + 0.001*"country" + 0.001*"development" + 0.001*"national" + 0.000*"programme" + 0.000*"public"'),
 (3,
  '0.001*"government" + 0.001*"work" + 0.001*"people" + 0.000*"year" + 0.000*"south" + 0.000*"ensure" + 0.000*"country" + 0.000*"programme" + 0.000*"national" + 0.000*"development"'),
 (4,
  '0.001*"year" + 0.001*"government" + 0.001*"people" + 0.001*"work" + 0.001*"south" + 0.000*"african" + 0.000*"country" + 0.000*"national" + 0.000*"development" + 0.000*"programme"')]

LDA (Latent Dirichlet Allocation)

[(0,
  '0.001*"year" + 0.001*"south" + 0.001*"government" + 0.001*"work" + 0.000*"african" + 0.000*"country" + 0.000*"programme" + 0.000*"people" + 0.000*"national" + 0.000*"new"'),
 (1,
  '0.000*"year" + 0.000*"work" + 0.000*"development" + 0.000*"government" + 0.000*"country" + 0.000*"national" + 0.000*"south" + 0.000*"programme" + 0.000*"continue" + 0.000*"energy"'),
 (2,
  '0.000*"government" + 0.000*"year" + 0.000*"work" + 0.000*"people" + 0.000*"south" + 0.000*"make" + 0.000*"development" + 0.000*"new" + 0.000*"country" + 0.000*"african"'),
 (3,
  '0.001*"year" + 0.001*"people" + 0.001*"government" + 0.001*"south" + 0.000*"work" + 0.000*"national" + 0.000*"african" + 0.000*"country" + 0.000*"public" + 0.000*"development"'),
 (4,
  '0.001*"government" + 0.001*"year" + 0.001*"people" + 0.001*"south" + 0.001*"work" + 0.001*"country" + 0.001*"development" + 0.001*"national" + 0.001*"programme" + 0.001*"ensure"')]

CTM (Correlated Topic Model)

Iteration: 0    Log-likelihood: -6.799771183041395
Iteration: 1    Log-likelihood: -6.513216470635273
Iteration: 2    Log-likelihood: -6.355882543279467
Iteration: 3    Log-likelihood: -6.2607084358537355
Iteration: 4    Log-likelihood: -6.197715604875132
Iteration: 5    Log-likelihood: -6.1645253555687605
Iteration: 6    Log-likelihood: -6.1122808732623835
Iteration: 7    Log-likelihood: -6.085954824218551
Iteration: 8    Log-likelihood: -6.059680097514997
Iteration: 9    Log-likelihood: -6.042303789103667
Iteration: 10   Log-likelihood: -6.034703158302843
Iteration: 11   Log-likelihood: -6.026411959046902
Iteration: 12   Log-likelihood: -6.012074849000191
Iteration: 13   Log-likelihood: -6.006581333479062
Iteration: 14   Log-likelihood: -5.981535018899553
Iteration: 15   Log-likelihood: -5.97984242965751
Iteration: 16   Log-likelihood: -5.970574501383571
Iteration: 17   Log-likelihood: -5.960764167081597
Iteration: 18   Log-likelihood: -5.963370076835946
Iteration: 19   Log-likelihood: -5.955796617616424
Iteration: 20   Log-likelihood: -5.943864613207089
Iteration: 21   Log-likelihood: -5.924873732106614
Iteration: 22   Log-likelihood: -5.925764058521468
Iteration: 23   Log-likelihood: -5.917318439500449
Iteration: 24   Log-likelihood: -5.923300406003926
Iteration: 25   Log-likelihood: -5.936856154077835
Iteration: 26   Log-likelihood: -5.927367016062913
Iteration: 27   Log-likelihood: -5.932287106411244
Iteration: 28   Log-likelihood: -5.925052711701778
Iteration: 29   Log-likelihood: -5.922297659435844
Iteration: 30   Log-likelihood: -5.9148964825677375
Iteration: 31   Log-likelihood: -5.920110826752398
Iteration: 32   Log-likelihood: -5.925217357349694
Iteration: 33   Log-likelihood: -5.926121272255882
Iteration: 34   Log-likelihood: -5.932963827134607
Iteration: 35   Log-likelihood: -5.936853457568642
Iteration: 36   Log-likelihood: -5.935402018157176
Iteration: 37   Log-likelihood: -5.921002639547479
Iteration: 38   Log-likelihood: -5.9315975701046835
Iteration: 39   Log-likelihood: -5.931833293586385
Iteration: 40   Log-likelihood: -5.915176835980096
Iteration: 41   Log-likelihood: -5.925503612210657
Iteration: 42   Log-likelihood: -5.920214925846161
Iteration: 43   Log-likelihood: -5.915519618559458
Iteration: 44   Log-likelihood: -5.91078095566771
Iteration: 45   Log-likelihood: -5.915563020301967
Iteration: 46   Log-likelihood: -5.90297347913047
Iteration: 47   Log-likelihood: -5.9146779799117875
Iteration: 48   Log-likelihood: -5.9052572820671605
Iteration: 49   Log-likelihood: -5.896244958880168
Iteration: 50   Log-likelihood: -5.897956814473697
Iteration: 51   Log-likelihood: -5.891216379529858
Iteration: 52   Log-likelihood: -5.882996521804093
Iteration: 53   Log-likelihood: -5.885107223648392
Iteration: 54   Log-likelihood: -5.8861502927951665
Iteration: 55   Log-likelihood: -5.884340942568278
Iteration: 56   Log-likelihood: -5.886141222155755
Iteration: 57   Log-likelihood: -5.879304754542334
Iteration: 58   Log-likelihood: -5.882409576832495
Iteration: 59   Log-likelihood: -5.879441705527177
Iteration: 60   Log-likelihood: -5.887988123391837
Iteration: 61   Log-likelihood: -5.882250356617471
Iteration: 62   Log-likelihood: -5.88277408570063
Iteration: 63   Log-likelihood: -5.880745324709502
Iteration: 64   Log-likelihood: -5.879870967814149
Iteration: 65   Log-likelihood: -5.874075120717159
Iteration: 66   Log-likelihood: -5.8840868342399135
Iteration: 67   Log-likelihood: -5.882302154805391
Iteration: 68   Log-likelihood: -5.87726827885877
Iteration: 69   Log-likelihood: -5.8772000665354165
Iteration: 70   Log-likelihood: -5.877679227560991
Iteration: 71   Log-likelihood: -5.8778453467191305
Iteration: 72   Log-likelihood: -5.8853993995654905
Iteration: 73   Log-likelihood: -5.879228626938545
Iteration: 74   Log-likelihood: -5.890349691145356
Iteration: 75   Log-likelihood: -5.888726733271527
Iteration: 76   Log-likelihood: -5.892956967823788
Iteration: 77   Log-likelihood: -5.889057054373432
Iteration: 78   Log-likelihood: -5.887914756529604
Iteration: 79   Log-likelihood: -5.894897742022889
Iteration: 80   Log-likelihood: -5.897905378786642
Iteration: 81   Log-likelihood: -5.90678878872963
Iteration: 82   Log-likelihood: -5.908479201250067
Iteration: 83   Log-likelihood: -5.900765073215898
Iteration: 84   Log-likelihood: -5.911757269362371
Iteration: 85   Log-likelihood: -5.906457945621606
Iteration: 86   Log-likelihood: -5.909260595351656
Iteration: 87   Log-likelihood: -5.910211183825542
Iteration: 88   Log-likelihood: -5.909865931081799
Iteration: 89   Log-likelihood: -5.9124118383851885
Iteration: 90   Log-likelihood: -5.90486904203658
Iteration: 91   Log-likelihood: -5.898546547332345
Iteration: 92   Log-likelihood: -5.908060579631446
Iteration: 93   Log-likelihood: -5.904451751982099
Iteration: 94   Log-likelihood: -5.902535769434896
Iteration: 95   Log-likelihood: -5.897933960910536
Iteration: 96   Log-likelihood: -5.908713180182573
Iteration: 97   Log-likelihood: -5.906865499118049
Iteration: 98   Log-likelihood: -5.899870508465152
Iteration: 99   Log-likelihood: -5.897041106380576
Topic #0: [('work', 0.08041610568761826), ('economic', 0.03734271973371506), ('address', 0.026227004826068878), ('provide', 0.02475070022046566), ('make', 0.02440333366394043), ('create', 0.023013869300484657), ('opportunity', 0.022840186953544617), ('increase', 0.02119019627571106), ('implement', 0.016848120838403702), ('build', 0.016066547483205795)]
Topic #1: [('ensure', 0.04742421582341194), ('include', 0.034458935260772705), ('business', 0.030891306698322296), ('nation', 0.030369214713573456), ('progress', 0.019492298364639282), ('land', 0.018535131588578224), ('000', 0.017490947619080544), ('focus', 0.016446763649582863), ('policy', 0.015576610341668129), ('national', 0.014967503026127815)]
Topic #2: [('south', 0.06261109560728073), ('african', 0.05913274735212326), ('continue', 0.04129508510231972), ('social', 0.03326813504099846), ('measure', 0.016857484355568886), ('plan', 0.0160547886043787), ('start', 0.015965601429343224), ('effort', 0.013468327932059765), ('especially', 0.011506184935569763), ('small', 0.01123862061649561)]
Topic #3: [('development', 0.0596558041870594), ('u', 0.047933220863342285), ('society', 0.0271799024194479), ('million', 0.024748550727963448), ('infrastructure', 0.02466171607375145), ('south', 0.018496504053473473), ('water', 0.01771499775350094), ('education', 0.01762816309928894), ('resource', 0.017020326107740402), ('matter', 0.01510997861623764)]
Topic #4: [('africa', 0.04934827238321304), ('service', 0.04361424595117569), ('regard', 0.02693343162536621), ('challenge', 0.02276322804391384), ('investment', 0.020330609753727913), ('implementation', 0.01711607724428177), ('parliament', 0.015117854811251163), ('like', 0.01477033831179142), ('critical', 0.013988425023853779), ('speaker', 0.0139015456661582)]
Topic #5: [('government', 0.08499310165643692), ('country', 0.06861845403909683), ('state', 0.03162387013435364), ('far', 0.026512207463383675), ('time', 0.026165654882788658), ('crime', 0.02252684347331524), ('security', 0.02079407311975956), ('high', 0.01810828410089016), ('say', 0.01724190078675747), ('capacity', 0.016722070053219795)]
Topic #6: [('programme', 0.05151817947626114), ('sector', 0.02975277788937092), ('support', 0.02738315798342228), ('life', 0.026066701859235764), ('make', 0.02439919114112854), ('project', 0.021941805258393288), ('place', 0.01843125745654106), ('important', 0.01781691052019596), ('government', 0.01658821851015091), ('change', 0.015271763317286968)]
Topic #7: [('people', 0.08195344358682632), ('national', 0.04559854790568352), ('improve', 0.03688393533229828), ('economy', 0.03582761809229851), ('need', 0.03547551482915878), ('great', 0.017694182693958282), ('energy', 0.01663786545395851), ('poverty', 0.015845628455281258), ('area', 0.013556942343711853), ('number', 0.013292863965034485)]
Topic #8: [('new', 0.04433754086494446), ('public', 0.043900296092033386), ('growth', 0.030870389193296432), ('community', 0.028334366157650948), ('job', 0.026323039084672928), ('process', 0.025623446330428123), ('honourable', 0.021163543686270714), ('child', 0.018102826550602913), ('president', 0.017753031104803085), ('billion', 0.017140887677669525)]
Topic #9: [('year', 0.11361329257488251), ('local', 0.01975954882800579), ('action', 0.01681307889521122), ('achieve', 0.0167264174669981), ('right', 0.015513165853917599), ('member', 0.01525318343192339), ('act', 0.014299913309514523), ('department', 0.013953269459307194), ('democratic', 0.01360662654042244), ('police', 0.013173321262001991)]

ATM (Author-Topic Model)

[(0,
  '0.009*"government" + 0.009*"year" + 0.007*"work" + 0.007*"people" + 0.007*"south" + 0.007*"country" + 0.006*"development" + 0.005*"african" + 0.005*"africa" + 0.005*"ensure"'),
 (1,
  '0.010*"government" + 0.009*"people" + 0.009*"year" + 0.007*"work" + 0.007*"south" + 0.007*"country" + 0.007*"africa" + 0.007*"african" + 0.006*"national" + 0.005*"make"'),
 (2,
  '0.011*"year" + 0.010*"government" + 0.009*"work" + 0.008*"south" + 0.007*"people" + 0.007*"country" + 0.005*"new" + 0.005*"national" + 0.005*"african" + 0.005*"sector"'),
 (3,
  '0.008*"year" + 0.007*"government" + 0.007*"work" + 0.007*"south" + 0.006*"people" + 0.006*"country" + 0.005*"development" + 0.004*"ensure" + 0.004*"national" + 0.004*"african"'),
 (4,
  '0.014*"year" + 0.010*"government" + 0.009*"south" + 0.007*"development" + 0.007*"people" + 0.007*"programme" + 0.007*"work" + 0.007*"national" + 0.006*"country" + 0.006*"african"')]

References

Hofmann, Thomas. 1999. “Probabilistic Latent Semantic Indexing.” In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57. SIGIR ’99. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/312624.312649.
Lafferty, John, and David Blei. 2005. “Correlated Topic Models.” In Advances in Neural Information Processing Systems, edited by Y. Weiss, B. Schölkopf, and J. Platt. Vol. 18. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2005/file/9e82757e9a1c12cb710ad680db11f6f1-Paper.pdf.